Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: run file hash algorithms in parallel #3636

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kzantow
Copy link
Contributor

@kzantow kzantow commented Jan 31, 2025

Description

This change plumbs through the parallelism config to the file hasher, so files are hashed in parallel using up to the number of threads specified in the parallelism config.

This is related to: #3266

Type of change

  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

@popey
Copy link
Contributor

popey commented Feb 6, 2025

I pulled this, built it locally, and then tested it with a few containers. Maybe I'm doing it wrong, but I don't see a positive difference. Is it only going to benefit certain use cases or container types?

I used three different containers, and ran this syft v1.19.0 and v1.19.0 with this patch (v1.19.0-pfh), with increasing SYFT_PARALLELISM setting. The remote Debian amd64 box has a 12-core Ryzen 5 3600 CPU.

SYFT_PARALLELISM makes little to no difference, and the release build is faster than this PR. It looks to me like most of the time is spent unpacking the docker container into /tmp by stereoscope, and in these cases, there is minimal time building the SBOM.

Is there a better test I could do?

docker.io/nextcloud:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 30 1
syft v1.19.0-pfh 23 2
syft v1.19.0-pfh 20 3
syft v1.19.0-pfh 20 4
syft v1.19.0-pfh 19 5
syft v1.19.0 21 1
syft v1.19.0 21 2
syft v1.19.0 21 3
syft v1.19.0 20 4
syft v1.19.0 21 5

docker.io/opensearchproject/opensearch:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 50 1
syft v1.19.0-pfh 39 2
syft v1.19.0-pfh 39 3
syft v1.19.0-pfh 33 4
syft v1.19.0-pfh 34 5
syft v1.19.0 32 1
syft v1.19.0 33 2
syft v1.19.0 35 3
syft v1.19.0 33 4
syft v1.19.0 33 5

docker.io/pytorch/pytorch:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 84 1
syft v1.19.0-pfh 78 2
syft v1.19.0-pfh 79 3
syft v1.19.0-pfh 81 4
syft v1.19.0-pfh 78 5
syft v1.19.0 79 1
syft v1.19.0 78 2
syft v1.19.0 78 3
syft v1.19.0 79 4
syft v1.19.0 78 5

@kzantow
Copy link
Contributor Author

kzantow commented Feb 6, 2025

@popey -- hmm... I wouldn't really expect to see much difference for smaller images (these are not "smaller" images); it's also dependent on the number of hashers you're using: more hashers should show more difference. An reasonable parallelism setting could be based on the number of CPU cores available, maybe: num cores * 2. The higher the parallelism, the more concurrent IO the system is able to do.

It's probably worth comparing apples to apples -- testing this vs. main, since there are a number of commits. I just updated this branch to current main.

@popey
Copy link
Contributor

popey commented Feb 6, 2025

Ok, I'll re-run with the new update, and larger degree of parallelism. What's your definition of "small" in image terms?
I was going to try one of the huggingface ones, but it exploded disk space unpacking it.

The ones I'm currently using are this kinda size...

 ✔ Cataloged contents a52af642fd0f5e4957ce42fa27b77dd0c898223e32dbf8266664bf261[38/3014]
   ├── ✔ Packages                        [417 packages]
   ├── ✔ File metadata                   [10,573 locations]
   ├── ✔ Executables                     [1,318 executables]
   └── ✔ File digests                    [10,573 files]
 ✔ Cataloged contents 0246c8b2d4b494dc7c4776dff472ba765be63a71bd9d8d878ac9c40f573379bb
   ├── ✔ Packages                        [859 packages]
   ├── ✔ Executables                     [378 executables]
   ├── ✔ File metadata                   [6,426 locations]
   └── ✔ File digests                    [6,426 files]
 ✔ Cataloged contents 11691e035a3651d25a87116b4f6adc113a27a29d8f5a6a583f8569e0ee5ff897
   ├── ✔ Packages                        [224 packages]
   ├── ✔ File digests                    [4,394 files]
   ├── ✔ File metadata                   [4,394 locations]
   └── ✔ Executables                     [1,341 executables]

@kzantow
Copy link
Contributor Author

kzantow commented Feb 6, 2025

For some reason, I didn't look at the image names 🤦 This change is a lot less about number of files and more about total bytes to process -- what are the sizes in GB?

Uncompressed sizes are:
nextcloud: ~400 MB
opensearchproject/opensearch: ~900 MB
pytorch/pytorch: ~3.4 GB

@popey
Copy link
Contributor

popey commented Feb 6, 2025

Image Size
docker.io/nextcloud:latest 1.29GB
docker.io/opensearchproject/opensearch:latest 1.12GB
docker.io/pytorch/pytorch:latest 7.6GB

Are these too small? I went for something a little bigger:

Image Size
docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest 21.3GB

docker.io/nextcloud:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 30 1
syft v1.19.0-pfh 19 24
syft v1.19.0 21 1
syft v1.19.0 21 24

docker.io/opensearchproject/opensearch:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 51 1
syft v1.19.0-pfh 33 24
syft v1.19.0 32 1
syft v1.19.0 32 24

docker.io/pytorch/pytorch:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 82 1
syft v1.19.0-pfh 79 24
syft v1.19.0 81 1
syft v1.19.0 78 24

docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest

program release duration SYFT_PARALLELISM
syft v1.19.0-pfh 418 1
syft v1.19.0-pfh 232 24
syft v1.19.0 266 1
syft v1.19.0 258 24

Hm, this is weird. I see the pfh PR has better times when using high parallelism, but not quite sure why the v1.19.0 ones are faster than the pfh ones to start with!?

@kzantow
Copy link
Contributor Author

kzantow commented Feb 6, 2025

Are you still using the v1.19.0 release instead of something built off main? I don't know exactly what v1.19.0-pfh is -- the branch in this PR? this branch merged with v1.19.0?

@popey
Copy link
Contributor

popey commented Feb 6, 2025

v1.19.0-pfh is this PR - rebuilt a couple of hours ago, after this PR was updated. Maybe poorly named, it's just this PR.
v1.19.0 is the release binary we put on Github.

@kzantow
Copy link
Contributor Author

kzantow commented Feb 6, 2025

@popey right, so it does not include all the other changes on main, it should be compared to main

@popey
Copy link
Contributor

popey commented Feb 6, 2025

Maybe, I'm more looking at this from a user perspective. What will 1.20 (or whatever it's called) look like compared to 1.19.
I will run this again, comparing the PR with the tip of main to get a more isolated comparison.

@popey
Copy link
Contributor

popey commented Feb 7, 2025

I re-ran my tests on this PR using measure-syft. It ran against this PR and main five times each. The summary is below, and specific details from the logs are further down. Looks great!

Syft Performance Test Results

Date: 2025-02-07 15:31:21
Container: docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest
Environment Variables:

  • SYFT_CHECK_FOR_APP_UPDATE=false
  • SYFT_PARALLELISM=24

Results

Version/Description Commit Min (s) Max (s) Avg (s)
main - 251.18 258.07 253.75
feat/parallelize-file-hashing - 229.86 233.74 231.25

Logs snippets

Main

$ grep file-digest-cataloger results/logs/syft_e584c9f4_run*
results/logs/syft_e584c9f4_run1_2025-02-07_145054.log:[0252]  INFO task completed elapsed=37.945572765s task=file-digest-cataloger
results/logs/syft_e584c9f4_run2_2025-02-07_145512.log:[0245]  INFO task completed elapsed=37.784157195s task=file-digest-cataloger
results/logs/syft_e584c9f4_run3_2025-02-07_145923.log:[0249]  INFO task completed elapsed=37.939367806s task=file-digest-cataloger
results/logs/syft_e584c9f4_run4_2025-02-07_150339.log:[0245]  INFO task completed elapsed=38.151224264s task=file-digest-cataloger
results/logs/syft_e584c9f4_run5_2025-02-07_150750.log:[0246]  INFO task completed elapsed=38.071557398s task=file-digest-cataloger

feat/parallelize-file-hashing

$ grep  file-digest-cataloger results/logs/syft_c5961b53_run*
results/logs/syft_c5961b53_run1_2025-02-07_151205.log:[0222]  INFO task completed elapsed=13.192338452s task=file-digest-cataloger
results/logs/syft_c5961b53_run2_2025-02-07_151555.log:[0224]  INFO task completed elapsed=12.47219267s task=file-digest-cataloger
results/logs/syft_c5961b53_run3_2025-02-07_151945.log:[0226]  INFO task completed elapsed=13.223577256s task=file-digest-cataloger
results/logs/syft_c5961b53_run4_2025-02-07_152337.log:[0227]  INFO task completed elapsed=13.637332118s task=file-digest-cataloger
results/logs/syft_c5961b53_run5_2025-02-07_152731.log:[0224]  INFO task completed elapsed=11.701824755s task=file-digest-cataloger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants